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Abstract 

Many fundamental multi-processor coordination problems can be expressed as counting problems: 
processes must cooperate to assign successive values from a given range, such as addresses in memory or 
destinations on an interconnection network. Conventional solutions to these problems perform poorly 
because of synchronization bottlenecks and high memory contention. 

Motivated by observations on the behavior of sorting networks, we offer a completely new approach 
to solving such problems. We introduce a new class of networks called counting networks, i.e., networks 
that can be used to count. We give a counting network construction of depth log 2 n using nlog 2 n 
"gates," avoiding the sequential bottlenecks inherent to former solutions, and having a provably lower 
contention factor on its gates. 

Finally, to show that counting networks are not merely mathematical creatures, we provide ex- 
perimental evidence that they outperform conventional synchronization techniques under a variety of 
circumstances. 
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1 Introduction 



Many fundamental multi-processor coordination problems can be expressed as counting problems: pro- 
cessors collectively assign successive values from a given range, such as addresses in memory or desti- 
nations on an interconnection network. In this paper, we offer a completely new approach to solving 
such problems, by introducing counting networks, a new class of networks that can be used to count. 

Counting networks, like sorting networks [2, 4, 5], are constructed from simple two-input two-output 
computing elements called balancers, connected to one another by wires. However, while an n input 
sorting network sorts a collection of n input values only if they arrive together, on separate wires, 
and propagate through the network in lockstep, a counting network can count any number N n of 
input values even if they arrive at arbitrary times, are distributed unevenly among the input wires, and 
propagate through the network asynchronously. 

Figure 2 provides an example of an execution of a 4-input, 4-output, counting network. A balancer 
is represented by two dots and a vertical line (see Figure 1). Intuitively, a balancer is just a toggle 
mechanism 1 , repeatedly sending the inputs it receives, one to the left and one to the right. It thus 
balances the number of values on its output wires. In the example of Figure 2, input values arrive on 
the network's input lines one after the other. For convenience we have numbered them by the order of 
their arrival (these numbers are not used by the network). As can be seen, the first input (numbered 1) 
enters on line 2 and leaves on line 1, the second leaves on line 2, and in general, the Nth. value will leave 
on line N mod 4. (The reader is encouraged to try this for him/herself.) Thus, if on the ith output line 
the network assigns to consecutive outputs the numbers i, i + 4, i + 2 • 4, .., it is counting the number of 
input values without actually passing them all through a shared computing element! 

Counting networks achieve a high level of throughput by decomposing interactions among processes 
into pieces that can be performed in parallel. This decomposition has two performance benefits: It 
eliminates serial bottlenecks and reduces memory contention. In practice, the performance of many 
shared-memory algorithms is often limited by conflicts at certain widely-shared memory locations, 
often called hot spots [19]. Reducing hot-spot conflicts has been the focus of hardware architecture 
design [1, 8, 12, 14, 11] and experimental work in software [3, 9, 10, 16, 20]. 

Counting networks are also non-blocking: processes that undergo halting failures or delays while 
using a counting network do not prevent other processes from making progress. This property is im- 
portant because existing shared-memory architectures are themselves inherently asynchronous; process 
step times are subject to timing uncertainties due to variations in instruction complexity, page faults, 
cache misses, and operating system activities such as preemption or swapping. 

We show a depth log 2 n construction of a counting network, using nlog 2 n balancers, and argue that 
our construction produces low levels of contention; we feel that many other concurrent shared-memory 
algorithms would benefit from a similar contention analysis. 

To illustrate the utility of counting networks, we show how to construct highly concurrent imple- 
mentations of two common data structures: shared counters and producer/consumer buffers. A shared 
counter is simply an object that issues the numbers 1 to n in response to n requests by processes. Shared 
counters are central to a number of shared-memory synchronization algorithms (e.g., [6, 12, 15, 20]). A 
producer/consumer buffer is a data structure in which items inserted by a pool of producer processes are 
removed by a pool of consumer processes. Compared to conventional techniques such as spin locks or 
semaphores, our counting network implementations provide higher throughput, less memory contention, 
and better tolerance for failures and delays. 

1 It is easy to implement a balancer using a Compare & Swap, Test & Set, or a randomized consensus primitive. 
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Figure 1: A Balancer. 



Our analysis of the counting network construction is supported by experiment. In the appendix, 
we compare the performance of several implementations of shared counters and producer/consumer 
buffers on an eighteen-processor Encore MultiMax. When the level of concurrency is sufficiently high, 
the counting network implementations outperform conventional implementations based on spin locks, 
sometimes dramatically. 

In summary, counting networks represent a new class of concurrent algorithms. They have a rich 
mathematical structure, they provide effective solutions to important problems, and they perform well 
in practice. We believe that counting networks have other potential uses, for example as interconnection 
networks [21] or as load balancers[18], and that they deserve further attention. 



2 Networks that Count 
2.1 Counting Networks 

Counting networks belong to a larger class of networks called balancing networks, constructed from 
wires and computing elements called balancers, in a manner very similar to that in which comparison 
networks [5] are constructed from wires and comparators. We begin by describing balancing networks. 

A balancer is a computing element with two input wires and two output wires 2 (see Figure 1). 
Tokens repeatedly arrive on one of the balancer's input wires, at arbitrary times, and are repeatedly 
output on its output wires. Intuitively, one may think of a balancer as a toggle mechanism, that given 
a stream of input tokens, repeatedly sends one token to the upper output wire and one to the lower, 
effectively balancing the number of tokens on its output wires. We denote by Xi, i £ {0, 1} the number 
of input tokens ever received on the balancer's ith input wire, and similarly by yi, i £ {0, 1} the number 
of tokens ever output on its ith output wire. Throughout the paper we will abuse this notation and 
use Xi (yi) both as the name of the ith input (output) wire and a count of the number of input tokens 
received on the wire. 

Let the state of a balancer at a given time be defined as the collection of tokens on its input and 
output wires. We can now formally state the safety and liveness properties of a balancer: 

1. In any state, xq + x\ > yo + yi (i.e. a balancer never creates output tokens). 

2 In Figure 1 as well as in the sequel, we adopt the notation of [5] and and draw wires as horizontal lines with balancers 
stretched vertically. 
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2. Given any finite number of input tokens m = xq + x\ to the balancer, it is guaranteed that within 
a finite amount of time, it will reach a quiescent state, that is, one in which xo + xi = yo+Ui = m 
(i.e. a balancer never swallows input tokens). 

3. In any quiescent state, yo = [ti/2] and yi = [m/2j. 

4. In any quiescent state the set of input tokens and output tokens are the same. 

A balancing network of width w is a collection of balancers, where output wires connected to input 
wires, having w designated input wires xo, x\, .., x w _\ (which are not connected to output wires of 
balancers), w designated output wires yo, yi, ..,y w -i (similarly unconnected), and containing no cycles. 
Let the state of a network at a given time be defined as the union of the states of all its component 
balancers. The safety and liveness of the network follow naturally from the above network definition 
and the properties of balancers, namely, that it is always the case that Yl7=o' Xi — Y^i=o Vii anc ^ f° r 
any finite sequence of m input tokens, within finite time the network reaches a quiescent state, i.e. one 
in which Yn=o Vi = m - 

It is important to note that we make no assumptions regarding the timing of token transitions from 
balancer to balancer in a balancing network — its behavior can be viewed as a completely asynchronous 
process, and is defined in the usual way by a schedule. 

To give the reader a feeling of what the above abstraction might represent, consider an implemen- 
tation on a shared memory multiprocessor. A balancing network is implemented as a shared data 
structure, where balancers are records and wires are pointers from one record to another. Each of the 
machine's asynchronous processors runs a program that repeatedly traverses the data structure from 
some input pointer to some output pointer, each time shepherding a new token through the network. 

We define the depth of a balancing network to be the maximal depth of any wire, where the depth of 
a wire is defined as 0 for a network input wire, and max(depth(xo), depth(x{)) + 1 for the output wires 
of a balancer having input wires x§ and x\. 

A counting network of width to is a is a balancing network whose outputs yo, .., y w -i have the 
following additional step property in quiescent states: 

In any quiescent state, 0 < yi — yj < 1 for any i < j. 

To illustrate this property, consider an execution in which tokens traverse the network sequentially, 
one completely after the other. Figure 2 shows such an execution on a Counter[4] network which we 
will define formally in Section 3. As can be seen, the network moves input tokens to output tokens 
in increasing order modulo w. Balancing networks having this property are called counting networks, 
because we can easily construct from them counters which count the total number of tokens that 
have passed through, or are currently in, the network. Counting is done by adding a "local counter" 
to each output wire i, so that tokens coming out of that wire are consecutively assigned numbers 
i, i + w, i + 2w, .., i+ (yi — l)w. (This application is described in greater detail in Section 4.) 

The step property can be defined in a number of ways which we will use interchangeably. The 
connection between them is stated in the following lemma: 

Lemma 2.1 If yo, ■ ■ ■ ,y w -i is & sequence of non-negative integers, the following statements are all 
equivalent: 
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Figure 2: A sequential execution for a Counter[4] counting network. 

1. For any i < j, 1 > yi — Uj > 0. 

2. Either yi = yj for all i, j , or there exists some c such that for any i < c and j > c, yi — yj = 1. 

s. ifm = Y™= 0 1 y i ,y i =\^}- 

It is the third form of the step property that makes counting networks usable as counters. 

The requirement that the outputs of a quiescent counting network have the step property might 
appear to tell us very little about the behavior of a counting network during an asynchronous execution, 
but in fact it is surprisingly powerful. The reason is that even in a state in which many tokens are 
passing through the network, if no new tokens arrive the network must eventually settle into a quiescent 
state. This fact constrains the behavior of the network, and makes it possible to prove such important 
properties as the following: 



Lemma 2.2 Suppose that in a given execution, a counting network with outputs yo, . . .y w -i is in & 
state where m tokens have entered the network and m! tokens have left it. Then there exist non-negative 
integers di, 0 < i < w, such that Yll=o d{ = m — m 1 and yi + d{ = f 22 ^] • 



2.2 Counting vs. Sorting 

Given a balancing network and a comparison network, we will say that they are isomorphic if one can be 
constructed from the other by replacing balancers by comparators or vice versa. The counting network 
in this paper is isomorphic to the Bitonic sorting network of Batcher [4]. To see that constructing 
counting networks is a challenging task, consider the following theorem: 

Theorem 2.3 // a balancing network counts, then its isomorphic comparison network sorts, but not 
vice versa. 



Proof outline: The balancing networks isomorphic to the Even-Odd or Insertion sorting networks 
[5] are not counting networks. 

To prove the other direction, we construct a mapping from the comparison network transitions to 
the isomorphic balancing network transitions, so that if the balancing network counts, the comparison 
network sorts. 
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By the 0-1 principle [5], a comparison network which sorts all sequences of 0's and l's correctly sorts 
all sequences. Take any arbitrary sequence of 0's and l's as inputs to the comparison network, and for 
the balancing network place a token on each 0 input wire and no token on each 1 input wire. If we run 
both networks in lockstep, the balancing network will simulate the comparison network. 

On every gate where two 0's meet in the comparison network, two tokens meet in the balancing 
network, so two 0's leave on each wire in the comparison network, and both tokens leave in the balancing 
network. On every gate where two l's meet in the comparison network, no tokens meet in the balancing 
network, so two l's leave on each wire in the comparison network, and no tokens leave in the balancing 
network. On every gate where a 0 and 1 meet in the comparison network, the 0 leaves on the lower 
wire and the 1 on the upper wire, while in the balancing network the token leaves on the lower wire, 
and no token on the upper wire. 

If the balancing network is a counting network, i.e., it has the step property, then the comparison 
network must have sorted the input sequence of 0's and l's. ■ 

2.3 Verifying That a Network Counts 

The 0-1 law for comparison networks allows one to verify a supposed sorting network by testing it on 
a relatively small range of possible executions, namely, those generated by input sequences of zeroes 
and ones. Does a similar law exist for counting networks? The answer is mixed: on the one hand, 
it is possible to show that a counting network can be tested by considering only a finite subset of its 
infinitely many possible executions. On the other hand, the size of that finite subset is dependent on 
the network's depth, and therefore may be very large. 

We first prove that in testing a network, one need only consider sequential executions, that is, 
executions in which tokens enter and leave the network one completely after the other. 

Theorem 2.4 // a balancing network maintains the step property in all sequential executions, it main- 
tains it in all executions. 

Thus the problem of testing a supposed counting network is reduced from examining all possible 
executions to examining all sequential executions. The problem can be reduced further by regarding the 
network as a finite-state automaton. Suppose we have a width-u; network with a total of m balancers. 
If the network is quiescent, we can describe its state completely by specifying for each balancer which of 
its outputs the next token to arrive will appear on; thus the network has at most 2 m reachable quiescent 
states. If we consider only sequential executions, we can treat the network as a finite-state machine 
whose states are the quiescent states and whose transitions correspond to running a token through the 
network starting at some input-stage balancer. In this representation, an execution may be described 
by specifying the sequence of input-stage balancers on which the tokens are introduced. 

Lemma 2.5 Let b be a sequence of input tokens of length n which takes the network from a reachable 
state q back to the same state q. Then if the network counts all sequences of up to 2n+ 2 m tokens, the 
length of b is a multiple of w and exactly ^ tokens leave on each output wire. 

Based on the above lemma, we can now prove that 

Theorem 2.6 // a width-w balancing network with m balancers counts in all sequential executions in 
which up to 3 • 2 m tokens pass through the network, it is a counting network. 
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Figure 3: A Merger [8] balancing network. 

Proof outline: By Theorem 2.4 it is enough to show that the network guarantees the step property 
in sequential executions. Thus we may regard the network as a finite-state machine as in the preceding 
lemma. 

Consider an input sequence a of length greater than 2 m . By the Pigeonhole Principal there exists 
some subsequence b of length at most 2 m such that a = a^bai and the state of the network after oto and 
aob is the same. Thus we can remove b without affecting the behavior of the network on otoai. Since 
Lemma 2.5 tells us that b contributes an equal number of tokens to each output, the network's output 
on aobai will have the step property if and only if its output on aooti does. Repeating such contractions 
will eventually yield an input sequence of length less than 2 m , for which the network guarantees the 
step property. ■ 

Finally, we give a lower bound on the number of tokens required by a test as in Theorem 2.6 3 Let 
us construct a would-be counting network of the following form. Take two counting networks of width 
w, labeling their outputs as oto . . -a w -i and bo . . .b w -i, respectively. Combine the two networks by 
running a balancer between ao and b w _\ and a second balancer between bo and a w -\. Now construct a 
k stage periodic balancing network of width 2w by joining k copies of the above network, the outputs 
of each stage connected to the corresponding inputs of the next. We can now prove that: 

Lemma 2.7 A periodic balancing network with k stages, constructed as above, will count in all execu- 
tions involving up to 0(2 k w) tokens, but is not a counting network. 

3 A Bitonic Counting Network 

Counting networks, of course, would not be interesting if we could not exhibit an example of one. In 
this section we describe how to construct a counting network whose width is any power of 2. The 
layout of this network is isomorphic to Batcher's Bitonic sorting network [4, 5], though its behavior and 
correctness arguments are completely different. We give an inductive construction, as this will later aid 
us in proving its correctness. 

3 A similar counter example can be constructed having any width, not just a power of 2. 
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Define the width w balancing network Merger[w] as follows. It has two sequences of inputs of 
length w/2, x and x' , and a single sequence of outputs y, of length w. Merger[w] will be constructed 
to guarantee that in a quiescent state where the sequences x and x' have the step property, y will also 
have the step property, a fact which will be proved in the next section. 

We define the network Merger[w] inductively (see example in Figure 3). Since w is a power of 2, we 
will repeatedly use the notation 2k in place of w . When k is equal to 1, the Merger[2&] network consists 
of a single balancer. For k > 1, we construct the Merger[2&] network from 2 Merger[&] networks and 
k balancers. Using a Merger[&] network we merge the even subsequence x 0 , x 2 , ■ ■ ■ , x^_ 2 of x with the 
odd subsequence x' x , x' 3 , . . ■x' k _ 1 (i.e. the input to the Merger[&] network is x 0 , . . .x^_2, x' x , . . ■x' k _ 1 ) 
while with a second Merger[&] network we merge the odd subsequence of x with the even subsequence 
of x' . Call the outputs of these two Merger[&] networks z and z' . The final stage of the network 
combines z and z' by sending each pair of lines Zi and z[ into a balancer whose outputs yield y 2 i and 

2/2i + l- 

The Merger[w] network consists of logw layers of w/2 balancers each. This Merger[w] network 
guarantees the step property on its outputs only when its odd and even input subsequences also have 
the step property — but we can guarantee this by providing those inputs as the outputs of smaller 
counting networks. We define Counter[w] to be the network constructed by passing the outputs 
from two Counter [u;/2] networks into a Merger[w] network, where the induction is grounded in the 
Counter[1] network which contains no balancers and simply passes its input directly to its output. 
This construction gives us a network consisting of ( loB ™ +1 ) layers each consisting of w/2 balancers. 



3.1 Proof of Correctness 

In this section we show that Counter[w] is a counting network. Before examining the network itself, 
we present some simple lemmas about the step property. 

Lemma 3.1 // a sequence has the step ■property, then so do all its subsequences. 



Lemma 3.2 // xq, . . . , Xk-i has the step property, then 

fk-l 



k/2-1 

J2 X2i = 

i=0 
k/2-1 

^2 X2i+i 

i=0 



k-1 
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Lemma 3.3 Let xo, ■ ■ ■ , Xk-i and yo,---,Uk-i be arbitrary sequences having the step property. If 
J2i=o x i = J2i=o Vi> tflen x i = Vi for all 0 < i < k. 



Lemma 3.4 Let xo, ■ ■ ■ , Xk-i and yo,...,yk-i be arbitrary sequences having the step property. If 

Ek — i — l 

i ^ j, 0 < i < k. 



^2i = Q Xi = ^2i = Q 1/i + 1, then there exists a unique j , 0 < j < k, such that Xj = y 2 ■ + 1, and Xi = yi for 



We now show that the Merger[w] networks preserves the step property. 
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Lemma 3.5 If Merger[2&] is quiescent, and its inputs xq, . . . , Xk-i and x' 0 , . . . , x' k _ 1 both have the 
step property, then its outputs yo, ■ ■ ■ , U2k-i have the step property. 

Proof outline: We argue by induction on log k. 

If 2k = 2, Merger[2&] is just a balancer, so its outputs are guaranteed to have the step property 
by the definition of a balancer. 

If 2k > 2, let zo, ... , Zk-i be the outputs of the first Merger[&] subnetwork, which merges the even 
subsequence of x with the odd subsequence of x' , and let z' 0 , . . . , z' k _ 1 be the outputs of the second. 
Since x and x' have the step property by assumption, so do their even and odd subsequences (Lemma 
3.1), and hence so do z and z' (induction hypothesis). Furthermore, = ["X^i/^l + LS^i/^J and 

^ z[ = \_^2xi/2\ + [5Z^/2] (Lemma 3.2). A straightforward case analysis shows that ^ Zi and ^ z\ 
can differ by at most 1. 

We claim that 0 < yi — yj < 1 for any i < j. If J^-Zi = X^'; then Lemma 3.3 implies that Zi = z\ 
for 0 < i < k/2. After the final layer of balancers, 

Vi ~ Vj = Z\il2\ - zy/2], 

and the result follows because z has the step property. Similarly, if ^ Zi and ^ z[ differ by one, Lemma 
3.4 implies that Zi = z[ for 0 < i < k/2, except for a unique j such that Zj and z'j differ by one. The 
difference 0 < yi—yj < 1 for any i < j can be expressed as the difference between earlier and later terms 
either of z or of z' , and the result follows because these two sequences both have the step property. ■ 

The proof of the following theorem is now immediate. 

Theorem 3.6 In any quiescent state, the outputs of Counter[w] have the step property. 

4 Applications 

We illustrate the utility of counting networks by constructing highly concurrent implementations of three 
common data structures: shared counters, producer/consumer buffers, and barriers. In Section 5 we 
give some experimental evidence that that counting network implementations have higher throughput 
than conventional implementations when contention is sufficiently high. 

4.1 Shared Counter 

A shared counter [6, 12, 7, 15, 20] issues the numbers 0 to n — 1 in response to the first n requests it 
receives. To construct the counter, start with an arbitrary width-w counting network. Associate an 
integer cell Ci with the i th output wire. Initially, Ci holds the value i. A process requests a number by 
traversing the counting network. After it exits the network on wire i, it atomically adds w to the value 
of ^ and returns c^'s previous value. 

Lemma 2.2 implies that: 

Lemma 4.1 Let x be the largest number yet returned by any operation on the counter. Let S be the 
set of numbers less than x which have not been returned by any operation on the counter. Then 
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1. The size of S is no greater than the number of operations still in progress. 

2. If y £ S, then y > x — w\S\. 

3. Each number in S will be returned by some operation in time A • d + A c , where d is the depth of 
the network, A is the maximum gate delay, and A c is the maximum time to update a cell on an 
output wire. 

4.2 Producer/Consumer Buffer 

A producer/ consumer buffer is a data structure in which items inserted by a pool of m producer processes 
are removed by a pool of m consumer processes. The buffer algorithm used here is essentially that of 
Gottleib, Lubachevsky, and Rudolph [12]. The buffer is an n-element circular array. There are two 
ra-process counting networks, a producer network, and a consumer network. A producer starts by 
traversing the producer network, leaving the network with value i. It then atomically inspects the i th 
buffer element, and, if it is _L, replaces it with the produced item. If that position is full, then the 
producer waits for the item to be consumed (or returns an exception). Similarly, a consumer traverses 
the consumer network, exits on wire j, and if the j th position holds an item, atomically replaces it 
with _L. If there is no item to consume, the consumer waits for an item to be produced (or returns an 
exception). 

Lemma 2.2 implies that: 

Lemma 4.2 Suppose m producers and ml consumers have entered a producer/consumer buffer built 
out of counting networks of depth d and maximum gate delay A. Assume that the time to update each 
hi once a process has left the counting network is negligible. Then if m < ml , every producer leaves the 
network in time 2dA and the network reaches a quiescent state. Similarly if m > ml , every consumer 
leaves the network in time 2dA and the network reaches a quiescent state. 

5 Performance 

The following is a summary of the more complete performance analysis provided in the full paper. 

We consider the performance of the network when each processor is assigned a fixed input wire, 
ensuring that the number of input tokens that can arrive simultaneously at an input wire is bounded. 
The network saturation S is defined to be the expected number of tokens at each balancer. For the 
COUNTER network, S = 2n/wd. The network is oversaturated if S > 1, and under saturated if S < 1. 
This measure is motivated by the assumption that in a sufficiently long computation, tokens are likely 
to be spread through the network in an approximately uniform distribution. 

Define the contention at a balancer at a given time to be the number of tokens pending on its input 
wires. An oversaturated network represents a full pipeline, hence its throughput is dominated by the 
per-balancer contention, not by the network depth. If a balancer with S tokens makes a transition 
in time A(5), then approximately w tokens emerge from the network every A(S) time units, yielding 
a throughput of w/A(S). A is an increasing function whose exact form depends on the particular 
architecture, but similar measures of degradation have been observed in practice to grow linearly or 
worse [3, 16]. The throughput of an oversaturated network is therefore maximized by choosing w and 
d to minimize S, bringing it as close as possible to 1. 
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The throughput of an undersaturated network is dominated by the network depth, not by the 
per-balancer contention, since the network pipeline is partially empty. Every 0(1/ S) time units, w 
tokens leave the network, yielding throughput 0(wS) . The throughput of an undersaturated network 
is therefore maximized by choosing w and d to increase S, bringing it as close as possible to 1. 

We implemented several data structures employing counting networks, as well as more conventional 
implementations using spin locks (which can be considered degenerate counting networks of width one). 
These implementations were done on an Encore Multimax, using Mul-T [13], a parallel dialect of Lisp. 
The spin lock is a simple "test-and-test-and-set" loop [17] written in assembly language, and provided 
by the Mul-T run-time system. Each balancer is protected by a single spin lock. 

We compare four shared counter implementations, counting networks of widths 16, 8, and 4, and a 
conventional spin lock implementation. For each network, we measured the elapsed time necessary for 
a 2 20 (approximately a million) tokens to traverse the network, controlling the level of concurrency. 

The width-16 network has 80 balancers, the width-8 network has 24 balancers, and the width-4 
network has 6 balancers. In Figure 5 the horizontal axis represents the number of processes executing 
concurrently. The vertical axis represents the elapsed time (in seconds) until all 2 20 tokens had tra- 
versed the network. With no concurrency, the networks are heavily undersaturated, and the spin lock's 
throughput is the highest by far. As saturation increases, however, so does the throughput for each of 
the networks. The width-4 network is undersaturated at concurrency levels less than 6. As the level 
of concurrency increases from 1 to 6, saturation approaches 1, and throughput increases as the elapsed 
time decreases. Beyond 6, saturation increases beyond 1, and throughput eventually starts to decrease. 
The other networks remain undersaturated for the range of the experiment; their throughputs continue 
to improve. Notice that as the level of concurrency increases, the spin lock's throughput degrades in 
an approximately linear fashion. 

5.1 Producer/Consumer Buffers 

Next, we compare the performance of several producer/consumer buffers. Each implementation has 8 
producer processes and 8 consumer processes. We consider buffers with networks of width 8, 4, and 2. 
The width-2 implementation is simply a pair of counters protected by spin locks. As a final control, 
we tested a circular buffer protected by a single spin lock, a structure that permits no concurrency 
between producers and consumers. Figure 5 shows the time in seconds needed to produce and consume 
2 20 tokens. Not surprisingly, the single spin-lock implementation is much slower than any of the others. 
The width-2 network is heavily oversaturated, the bitonic width-4 network is slightly oversaturated, 
while the others are undersaturated. 
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Figure 5: Producer/Consumer Buffer Implementations 
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